Data and Decision Insights – ADEO Technical Assessment¶

Table of Contents¶

  1. Import Relevant Packages
  2. Setup Notebook Configuration
  3. Load the Data Frames
  4. Perform EDA (Exploratory Data Analysis)
  5. Cluster Our Data Frame
  6. Feature (Column) Understanding
  7. Model Training and Evaluation

Import Relevant Packages:¶

The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).

Setup Notebook Configuration:¶

Load the Data Frames:¶

Perform EDA (Exploratory Data Analysis)¶

Peek into the data:¶

*** current_df ***

  Country name Regional indicator  Ladder score  Standard error of ladder score  upperwhisker  lowerwhisker  Logged GDP per capita  Social support  Healthy life expectancy  Freedom to make life choices  Generosity  Perceptions of corruption  Ladder score in Dystopia  Explained by: Log GDP per capita  Explained by: Social support  Explained by: Healthy life expectancy  Explained by: Freedom to make life choices  Explained by: Generosity  Explained by: Perceptions of corruption  Dystopia + residual
0      Finland     Western Europe         7.842                           0.032         7.904         7.780                 10.775           0.954                     72.0                         0.949      -0.098                      0.186                      2.43                             1.446                         1.106                                  0.741                                       0.691                     0.124                                    0.481                3.253
1      Denmark     Western Europe         7.620                           0.035         7.687         7.552                 10.933           0.954                     72.7                         0.946       0.030                      0.179                      2.43                             1.502                         1.108                                  0.763                                       0.686                     0.208                                    0.485                2.868
2  Switzerland     Western Europe         7.571                           0.036         7.643         7.500                 11.117           0.942                     74.4                         0.919       0.025                      0.292                      2.43                             1.566                         1.079                                  0.816                                       0.653                     0.204                                    0.413                2.839
3      Iceland     Western Europe         7.554                           0.059         7.670         7.438                 10.878           0.983                     73.0                         0.955       0.160                      0.673                      2.43                             1.482                         1.172                                  0.772                                       0.698                     0.293                                    0.170                2.967
4  Netherlands     Western Europe         7.464                           0.027         7.518         7.410                 10.932           0.942                     72.4                         0.913       0.175                      0.338                      2.43                             1.501                         1.079                                  0.753                                       0.647                     0.302                                    0.384                2.798 

######################################################################################################################################################

    Country name  Regional indicator  Ladder score  Standard error of ladder score  upperwhisker  lowerwhisker  Logged GDP per capita  Social support  Healthy life expectancy  Freedom to make life choices  Generosity  Perceptions of corruption  Ladder score in Dystopia  Explained by: Log GDP per capita  Explained by: Social support  Explained by: Healthy life expectancy  Explained by: Freedom to make life choices  Explained by: Generosity  Explained by: Perceptions of corruption  Dystopia + residual
144      Lesotho  Sub-Saharan Africa         3.512                           0.120         3.748         3.276                  7.926           0.787                   48.700                         0.715      -0.131                      0.915                      2.43                             0.451                         0.731                                  0.007                                       0.405                     0.103                                    0.015                1.800
145     Botswana  Sub-Saharan Africa         3.467                           0.074         3.611         3.322                  9.782           0.784                   59.269                         0.824      -0.246                      0.801                      2.43                             1.099                         0.724                                  0.340                                       0.539                     0.027                                    0.088                0.648
146       Rwanda  Sub-Saharan Africa         3.415                           0.068         3.548         3.282                  7.676           0.552                   61.400                         0.897       0.061                      0.167                      2.43                             0.364                         0.202                                  0.407                                       0.627                     0.227                                    0.493                1.095
147     Zimbabwe  Sub-Saharan Africa         3.145                           0.058         3.259         3.030                  7.943           0.750                   56.201                         0.677      -0.047                      0.821                      2.43                             0.457                         0.649                                  0.243                                       0.359                     0.157                                    0.075                1.205
148  Afghanistan          South Asia         2.523                           0.038         2.596         2.449                  7.695           0.463                   52.493                         0.382      -0.102                      0.924                      2.43                             0.370                         0.000                                  0.126                                       0.000                     0.122                                    0.010                1.895 

######################################################################################################################################################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 20 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                149 non-null    object 
 1   Regional indicator                          149 non-null    object 
 2   Ladder score                                149 non-null    float64
 3   Standard error of ladder score              149 non-null    float64
 4   upperwhisker                                149 non-null    float64
 5   lowerwhisker                                149 non-null    float64
 6   Logged GDP per capita                       149 non-null    float64
 7   Social support                              149 non-null    float64
 8   Healthy life expectancy                     149 non-null    float64
 9   Freedom to make life choices                149 non-null    float64
 10  Generosity                                  149 non-null    float64
 11  Perceptions of corruption                   149 non-null    float64
 12  Ladder score in Dystopia                    149 non-null    float64
 13  Explained by: Log GDP per capita            149 non-null    float64
 14  Explained by: Social support                149 non-null    float64
 15  Explained by: Healthy life expectancy       149 non-null    float64
 16  Explained by: Freedom to make life choices  149 non-null    float64
 17  Explained by: Generosity                    149 non-null    float64
 18  Explained by: Perceptions of corruption     149 non-null    float64
 19  Dystopia + residual                         149 non-null    float64
dtypes: float64(18), object(2)
memory usage: 23.4+ KB
None 

######################################################################################################################################################

*** historic_df ***

  Country name  year  Life Ladder  Log GDP per capita  Social support  Healthy life expectancy at birth  Freedom to make life choices  Generosity  Perceptions of corruption  Positive affect  Negative affect
0  Afghanistan  2008        3.724               7.370           0.451                             50.80                         0.718       0.168                      0.882            0.518            0.258
1  Afghanistan  2009        4.402               7.540           0.552                             51.20                         0.679       0.190                      0.850            0.584            0.237
2  Afghanistan  2010        4.758               7.647           0.539                             51.60                         0.600       0.121                      0.707            0.618            0.275
3  Afghanistan  2011        3.832               7.620           0.521                             51.92                         0.496       0.162                      0.731            0.611            0.267
4  Afghanistan  2012        3.783               7.705           0.521                             52.24                         0.531       0.236                      0.776            0.710            0.268 

######################################################################################################################################################

     Country name  year  Life Ladder  Log GDP per capita  Social support  Healthy life expectancy at birth  Freedom to make life choices  Generosity  Perceptions of corruption  Positive affect  Negative affect
1944     Zimbabwe  2016        3.735               7.984           0.768                              54.4                         0.733      -0.095                      0.724            0.738            0.209
1945     Zimbabwe  2017        3.638               8.016           0.754                              55.0                         0.753      -0.098                      0.751            0.806            0.224
1946     Zimbabwe  2018        3.616               8.049           0.775                              55.6                         0.763      -0.068                      0.844            0.710            0.212
1947     Zimbabwe  2019        2.694               7.950           0.759                              56.2                         0.632      -0.064                      0.831            0.716            0.235
1948     Zimbabwe  2020        3.160               7.829           0.717                              56.8                         0.643      -0.009                      0.789            0.703            0.346 

######################################################################################################################################################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1949 entries, 0 to 1948
Data columns (total 11 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country name                      1949 non-null   object 
 1   year                              1949 non-null   int64  
 2   Life Ladder                       1949 non-null   float64
 3   Log GDP per capita                1913 non-null   float64
 4   Social support                    1936 non-null   float64
 5   Healthy life expectancy at birth  1894 non-null   float64
 6   Freedom to make life choices      1917 non-null   float64
 7   Generosity                        1860 non-null   float64
 8   Perceptions of corruption         1839 non-null   float64
 9   Positive affect                   1927 non-null   float64
 10  Negative affect                   1933 non-null   float64
dtypes: float64(9), int64(1), object(1)
memory usage: 167.6+ KB
None 

######################################################################################################################################################

Clean column names as necessary:¶

*** current_df ***

Old column names: ['Country name' 'Regional indicator' 'Ladder score' 'Standard error of ladder score' 'upperwhisker' 'lowerwhisker' 'Logged GDP per capita' 'Social support' 'Healthy life expectancy' 'Freedom to make life choices' 'Generosity' 'Perceptions of corruption' 'Ladder score in Dystopia' 'Explained by: Log GDP per capita' 'Explained by: Social support' 'Explained by: Healthy life expectancy' 'Explained by: Freedom to make life choices' 'Explained by: Generosity' 'Explained by: Perceptions of corruption' 'Dystopia + residual']
New column names: ['Country_Name' 'Regional_Indicator' 'Ladder_Score' 'Standard_Error_Of_Ladder_Score' 'Upperwhisker' 'Lowerwhisker' 'Logged_Gdp_Per_Capita' 'Social_Support' 'Healthy_Life_Expectancy' 'Freedom_To_Make_Life_Choices' 'Generosity' 'Perceptions_Of_Corruption' 'Ladder_Score_In_Dystopia' 'Explained_By:_Log_Gdp_Per_Capita' 'Explained_By:_Social_Support' 'Explained_By:_Healthy_Life_Expectancy' 'Explained_By:_Freedom_To_Make_Life_Choices' 'Explained_By:_Generosity' 'Explained_By:_Perceptions_Of_Corruption' 'Dystopia_+_Residual'] 

######################################################################################################################################################

*** historic_df ***

Old column names: ['Country name' 'year' 'Life Ladder' 'Log GDP per capita' 'Social support' 'Healthy life expectancy at birth' 'Freedom to make life choices' 'Generosity' 'Perceptions of corruption' 'Positive affect' 'Negative affect']
New column names: ['Country_Name' 'Year' 'Life_Ladder' 'Log_Gdp_Per_Capita' 'Social_Support' 'Healthy_Life_Expectancy_At_Birth' 'Freedom_To_Make_Life_Choices' 'Generosity' 'Perceptions_Of_Corruption' 'Positive_Affect' 'Negative_Affect'] 

######################################################################################################################################################

Check for duplicated records (rows) in our data frames (if any):¶

False
False


######################################################################################################################################################

False

Enrich our dataset in a meaningful way by appropriately combining the two datasets into one:¶

     Country_Name  Regional_Indicator  Year  Happiness_Index  Logged_Gdp_Per_Capita  Social_Support  Healthy_Life_Expectancy  Freedom_To_Make_Life_Choices  Generosity  Perceptions_Of_Corruption
0     Afghanistan          South Asia  2008            3.724                  7.370           0.451                   50.800                         0.718       0.168                      0.882
1     Afghanistan          South Asia  2009            4.402                  7.540           0.552                   51.200                         0.679       0.190                      0.850
2     Afghanistan          South Asia  2010            4.758                  7.647           0.539                   51.600                         0.600       0.121                      0.707
3     Afghanistan          South Asia  2011            3.832                  7.620           0.521                   51.920                         0.496       0.162                      0.731
4     Afghanistan          South Asia  2012            3.783                  7.705           0.521                   52.240                         0.531       0.236                      0.776
...           ...                 ...   ...              ...                    ...             ...                      ...                           ...         ...                        ...
2030     Zimbabwe  Sub-Saharan Africa  2017            3.638                  8.016           0.754                   55.000                         0.753      -0.098                      0.751
2031     Zimbabwe  Sub-Saharan Africa  2018            3.616                  8.049           0.775                   55.600                         0.763      -0.068                      0.844
2032     Zimbabwe  Sub-Saharan Africa  2019            2.694                  7.950           0.759                   56.200                         0.632      -0.064                      0.831
2033     Zimbabwe  Sub-Saharan Africa  2020            3.160                  7.829           0.717                   56.800                         0.643      -0.009                      0.789
2034     Zimbabwe  Sub-Saharan Africa  2021            3.145                  7.943           0.750                   56.201                         0.677      -0.047                      0.821

[2035 rows x 10 columns]

Fix data frame's column datatypes as necessary:¶

Changed column "Country_Name`s" datatype to "category"
Changed column "Regional_Indicator`s" datatype to "category"

Take a quick glance at the modified data's statistics:¶

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2035 entries, 0 to 2034
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Country_Name                  2035 non-null   category
 1   Regional_Indicator            2035 non-null   category
 2   Year                          2035 non-null   int64   
 3   Happiness_Index               2035 non-null   float64 
 4   Logged_Gdp_Per_Capita         2011 non-null   float64 
 5   Social_Support                2026 non-null   float64 
 6   Healthy_Life_Expectancy       1984 non-null   float64 
 7   Freedom_To_Make_Life_Choices  2005 non-null   float64 
 8   Generosity                    1959 non-null   float64 
 9   Perceptions_Of_Corruption     1931 non-null   float64 
dtypes: category(2), float64(7), int64(1)
memory usage: 138.9 KB
None 

######################################################################################################################################################

              Year  Happiness_Index  Logged_Gdp_Per_Capita  Social_Support  Healthy_Life_Expectancy  Freedom_To_Make_Life_Choices   Generosity  Perceptions_Of_Corruption
count  2035.000000      2035.000000            2011.000000     2026.000000              1984.000000                   2005.000000  1959.000000                1931.000000
mean   2013.826536         5.490948               9.391096        0.814959                63.695212                      0.748269    -0.002346                   0.746277
std       4.514250         1.107523               1.141129        0.116125                 7.376080                      0.139289     0.162257                   0.186760
min    2005.000000         2.375000               6.635000        0.291000                32.300000                      0.258000    -0.335000                   0.035000
25%    2010.000000         4.669000               8.484000        0.751000                59.180000                      0.656000    -0.117000                   0.690000
50%    2014.000000         5.420000               9.487000        0.836000                65.400000                      0.769000    -0.029000                   0.801000
75%    2018.000000         6.298000              10.370500        0.906750                68.800000                      0.861000     0.089000                   0.870000
max    2021.000000         8.019000              11.648000        0.987000                77.100000                      0.985000     0.698000                   0.983000

Visualize the number of missing values in each column of our data frame:¶

Given that "Perceptions_Of_Corruption" has the highest number of missing values, visualize which countries most contributed to that:¶

Visualize which countries had the highest number of missing values grouped by region:¶

Imputation: Fill in the missing values for each column using the appropriate techniques:¶

Column "Perceptions_Of_Corruption"`s values are almost completely missing for country: "China"
Column "Healthy_Life_Expectancy"`s values are almost completely missing for country: "Hong Kong S.A.R. of China"
Column "Healthy_Life_Expectancy"`s values are almost completely missing for country: "Kosovo"
Column "Perceptions_Of_Corruption"`s values are almost completely missing for country: "Turkmenistan"


######################################################################################################################################################

Does our data frame contain any missing values? False

Cluster Our Data Frame Based on The Happiness_Index to Discover Underlying (latent) Happiness Levels (clusters)¶

Feature (Column) Understanding¶

Perform univariate feature analysis:¶

Breakdown how the countries from different happiness levels make up that column's distribution:¶

Further deepdive into that column's statistics:¶

Perform Bivariate feature analysis:¶

Model Training and Evaluation¶

Feature Engineering: Prepare data to be digestible appropriately by machine learning models for training:¶

Column "Regional_Indicator`s" unique values are: ['Central and Eastern Europe' 'Commonwealth of Independent States' 'East Asia' 'Latin America and Caribbean' 'Middle East and North Africa' 'North America and ANZ' 'South Asia' 'Southeast Asia' 'Sub-Saharan Africa' 'Western Europe']
Column "Year`s" unique values are: [2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021]

Start training our models:

Model: linear
	Model's CV average R2 score: 73.8%
	Model's CV average MSE loss value: 0.3848
	Model's test R2 score: 77.8%

Model: XGBoost
	Model's CV average R2 score: 75.07%
	Model's CV average MSE loss value: 0.3823
	Model's test R2 score: 85.8%

Model: hist_boosting
	Model's CV average R2 score: 75.82%
	Model's CV average MSE loss value: 0.3793
	Model's test R2 score: 84.81%

Model: random_forest
	Model's CV average R2 score: 75.46%
	Model's CV average MSE loss value: 0.3837
	Model's test R2 score: 86.76%

100%|##########| 4/4 [00:08<00:00,  2.00s/it]